Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Packaging import through producers platform #8207

Merged
merged 68 commits into from
Apr 4, 2023
Merged

Conversation

stephanegigandet
Copy link
Contributor

@stephanegigandet stephanegigandet commented Mar 15, 2023

This PR enables producers to send us detailed packaging import data through CSV / Excel files uploaded on the producers platform.

The default is for producers to send fields like "packaging 1 shape", "packaging 1 material" etc. for each packaging component, with separate columns for each component.

At least one big producer (Les Mousquetaires / Intermarché) is sending us data with multiple lines for one product (one for each packaging component), so we now have a mechanism to support this as well.

Also included changes:

  • Extended packaging shapes and materials taxonomies, to support values sent from some producers
  • New packaging-shapes, packaging-materials and packaging-recycling facets, that are very useful to see if we can correctly map producer data to our taxonomies. Those are populated from the packagings data structure.
  • New feature in Tags.pm canonicalize_taxonomy_tag() now recognizes entries like "Parent / Child" and "Synonym 1 / Synonym 2" (respectively mapped to the child, and to the entry that matches both synonyms)
  • Remove the import of packaging data from GS1 (we only had one single shape for all of the product, the data is often incorrect. GS1 now has a new much improved format for packaging data, that we can add support for)
  • Fix for 2 exact same packaging components added through API are conflated into one #8197
  • Some refactoring (e.g. deduplicating regular expressions used to process imported data)
  • A lot of tests

@stephanegigandet stephanegigandet requested a review from a team as a code owner March 15, 2023 09:59
@stephanegigandet stephanegigandet marked this pull request as draft March 15, 2023 09:59
@github-actions github-actions bot added Data import 🌱 Green-Score https://world.openfoodfacts.org/eco-score-the-environmental-impact-of-food-products 🧪 integration tests 🏭 Producers Platform https://wiki.openfoodfacts.org/Platform_for_producers 🧪 tests Translations We use a non-standard version of GetText, lack language variants support translate.openfoodfacts.org labels Mar 20, 2023
@github-actions github-actions bot added the STO label Mar 21, 2023
@github-actions github-actions bot added the 💥 Merge Conflicts 💥 Merge Conflicts label Mar 23, 2023
Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work !

I'll let you see my (minor) comments.

Comment on lines 1125 to 1126
# packaging data is specified in the CSV file in columns named like packagings_1_number_of_units
for (my $i = 1; $i <= 10; $i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# packaging data is specified in the CSV file in columns named like packagings_1_number_of_units
for (my $i = 1; $i <= 10; $i++) {
# packaging data is specified in the CSV file in columns named like packagings_1_number_of_units
# we currently search up to 10 components
$IMPORT_MAX_COMPONENTS = 10;
for (my $i = 1; $i <= $IMPORT_MAX_COMPONENTS; $i++) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

lib/ProductOpener/Import.pm Show resolved Hide resolved
Comment on lines +1153 to +1155
if ($data_is_complete) {
# We seem to have complete data, replace existing data
$product_ref->{packagings} = \@input_packagings;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$data_is_complete only tells that you have at least one complete line. Is this enough to consider complete ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging packaging data is very tricky and very likely to generate duplicates, so if we have weights from the producer, for at least one component, I think it's better to replace the whole structure.

Comment on lines +132 to +135
$empty_regexp = '(?:,|\%|;|_|°|-|\/|\\|\.|\s)*';
$unknown_regexp = 'unknown|inconnu|inconnue|non renseigné(?:e)?(?:s)?|nr|n\/r';
$not_applicable_regexp = 'n(?:\/|\\|\.|-)?a(?:\.)?|(?:not|non)(?: |-)applicable|no aplica';
$none_regexp = 'none|aucun|aucune|aucun\(e\)';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really cool.

lib/ProductOpener/Producers.pm Show resolved Hide resolved

# TODO verify images
# clean csv and sto
unlink $inputs_dir . "eco-score-template.xlsx.csv";
unlink $inputs_dir . "test.columns_fields.sto";
rmdir remove_tree($outputs_dir);
#rmdir remove_tree($outputs_dir);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shan't we remove outputs dir ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, I commented it for debugging and forgot it

@@ -770,6 +769,23 @@
"origin" : "france",
"origin_en" : "france",
"other_nutritional_substances_tags" : [],
"packaging_materials_tags" : [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have all those changes ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a new feature, I added new packaging_(shapes|materials|recycling) facets in order to be able to better see what producers will send us, and what needs to be added to the taxonomies.

create_sto_from_json($columns_fields_json, $columns_fields_file);

# step3 convert file
my $converted_file = $outputs_dir . "test.converted.csv";
my $conv_result;
($out, $err, $conv_result) = capture_ouputs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I did use capture_outputs because I was a bit annoyed by the long logs when running test that make them really hard to exploit…

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue I have with it is that if the code inside dies, then the test actually passes, and it's very hard to understand what happened.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can fix it and keep the capture it would be cool :-)

tests/unit/packaging.t Outdated Show resolved Hide resolved
Comment on lines +19 to +25
foreach my $l (qw(en fr es)) {
compare_to_expected_results(
init_fields_columns_names_for_lang($l),
$expected_result_dir . "/column_names_$l.json",
$update_expected_results
);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😎 Cool addition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it made me realize that the generated hash tables are huge, and that we probably need a better solution than generating zillions of column synonyms. Maybe do some universal synonyms normalization first.

@github-actions github-actions bot removed the 💥 Merge Conflicts 💥 Merge Conflicts label Mar 30, 2023
stephanegigandet and others added 7 commits March 30, 2023 18:28
Co-authored-by: Alex Garel <alex@garel.org>
Co-authored-by: Alex Garel <alex@garel.org>
Co-authored-by: Alex Garel <alex@garel.org>
Co-authored-by: Alex Garel <alex@garel.org>
Co-authored-by: Alex Garel <alex@garel.org>
@stephanegigandet stephanegigandet changed the title feat: Packaging import through producers platform (wip) feat: Packaging import through producers platform Mar 30, 2023
@github-actions github-actions bot added the Text label Mar 31, 2023
@sonarqubecloud
Copy link

sonarqubecloud bot commented Apr 3, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Issues related to the Open Food Facts API. More specific labels exist & should be used (API WRITE…) Data import 🌱 Green-Score https://world.openfoodfacts.org/eco-score-the-environmental-impact-of-food-products GS1 The producer platform is integrating with the GS1 product data formats. GS1 manages barcodes. 🧪 integration tests 📖 Knowledge Panels https://wiki.openfoodfacts.org/Knowledge_panels 🧴 Open Beauty Facts Our cosmetic analysis project https://world.openbeautyfacts.org 📦 Packaging https://wiki.openfoodfacts.org/Category:Recycling 🏭 Producers Platform https://wiki.openfoodfacts.org/Platform_for_producers STO Tags 🧬 Taxonomies https://wiki.openfoodfacts.org/Global_taxonomies Template::Toolkit The templating toolkit used by product opener. The starting point for HTML/JS/CSS fixes. 🧪 tests Text Translations We use a non-standard version of GetText, lack language variants support translate.openfoodfacts.org 🧪 unit tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants